Covid19 Analysis
The objective of this analysis is to analyse covid data. I downloaded the data from Our World in Data.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from IPython.display import HTML
import logging
logging.getLogger('fbprophet').setLevel(logging.INFO)
input_file = 'owid-covid-data.csv'
df = pd.read_csv(input_file)
After that, let's explore the data a bit to help us understand more.
- The 'date' column is string, which is not suitable for our time series analysis later on. So, I created another column 'ds' to convert it to datetime format.
- There are some continents in the location column. Since we are only interested in the countries, we drop the these values from the dataframe.
df['ds'] = pd.to_datetime(df['date'], format="%d/%m/%Y")
df = df.sort_values(by=['ds'])
valuesToDrop = ['Asia', 'World', 'International', 'European Union', 'Europe', 'North America', 'Africa',
'South America', 'Oceania']
df1 = df[~(df['location'].isin(valuesToDrop))]
df1 = df1.dropna(subset=['new_cases'])
First, I want to see how cases have changed over time.
df2 = df.copy()
fig = px.choropleth(df2,
locations='iso_code',
color='new_cases',
hover_name='location',
animation_frame="date",
color_continuous_scale=px.colors.sequential.Reds)
Time Series Analysis is a useful tool in data analytics. It is suitable for data which are ordered by time, e.g. stock, weather. It provides opportunity to forecast future values, based on previous values. In below section, I try to predict cases using facebook Prophet library.
The Phophet library requires 2 columns, ds, and y. The 'ds' column contains the datetime, and 'y' columns contains the value we want to predict. Before we can do any analysis, we first prepare the data into the format Prophet expect to see.
Each row in the dataframe shows the number of cases for each country on a daily basis. We aggregate the data so that each row has the total cases on a particular day, and sort it by the datestamp.
df4 = df.groupby(by=['ds'])['new_cases'].sum().reset_index().sort_values(by='ds', ascending=True)
df4 = df4.rename(columns={'new_cases':'y'})
Then, we can fit a model on the dataset.
model = Prophet().fit(df4);
future = model.make_future_dataframe(periods=365);
forecast = model.predict(future);
Let's make a forecast for the next 365 days. Prophet requires us to make a future dataframe that extends into the future a specified number of days we want to forecast. Then the predict() object predicts the value. The future dataframe will include both the forecast values and the historical values. It is noted that when we do prediction, the longer the timeframe, the less accurate the prediction because there would be more uncertainities towards it.
fig = model.plot(forecast)
ax = fig.gca()
ax.set_title("Covid cases projection", size=16)
ax.set_xlabel('Date', size=10)
ax.set_ylabel('Cases', size=10)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=10)
ax.yaxis.get_major_formatter().set_scientific(False)
g1 = add_changepoints_to_plot(fig.gca(), model, forecast)
The model is not fitting the data well! I used the default setting of this model. Let's do some hyperparameter tuning in the model to see if the model improves.
I set weekly_seasonality to False as I don't think there is a seasonality in a weekly basis. I also set the changepoint_prior_scale to 0.4. By increasing it, it will make the trend more flexible. The seasonality mode is set to multiplicative.
m = Prophet(weekly_seasonality=False, changepoint_prior_scale=0.4,seasonality_mode='multiplicative').fit(df4);
future = m.make_future_dataframe(periods=365);
forecast = m.predict(future);
fig1 = m.plot(forecast)
ax = fig1.gca()
ax.set_title("Covid cases projection", size=16)
ax.set_xlabel('Date', size=10)
ax.set_ylabel('Cases', size=10)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=10)
ax.yaxis.get_major_formatter().set_scientific(False)
g1 = add_changepoints_to_plot(fig1.gca(), model, forecast)
The model can fit pretty well to the data now. It is able to capture the trend.
One interesting point is that according to this model, the cases would be 0 in about Jun 2022.
Citation: Hannah Ritchie, Edouard Mathieu, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Esteban Ortiz-Ospina, Joe Hasell, Bobbie Macdonald, Diana Beltekian and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource]